Creating a model with the base variables Using backwards elimination on the full model, we continuously removed variables with p-values greater than 0.1 to create an efficient model. We arrived at the model gdp ~ bir + lem + gro + dep which granted an AIC of 1877 and an adjusted R^2 of 0.7867 (both values could be smaller with additional variables, but only minimally so). The step function included the age variable as well which provided a small decrease to AIC and increase to adj-R^2.

Nevertheless, neither model suffices as our residual plot features clear indicators of unequal variance and a lack of model fit.

## 
## Call:
## lm(formula = gdp ~ bir + lem + gro + dep, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11281.9  -1487.4   -380.6    853.3  14227.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2763.92    7440.60  -0.371 0.711184    
## bir           166.24     116.08   1.432 0.155661    
## lem           170.06      92.45   1.839 0.069225 .  
## gro2         1464.64    1836.97   0.797 0.427415    
## gro3        15213.05    1439.34  10.569  < 2e-16 ***
## gro4         6789.72    2168.06   3.132 0.002360 ** 
## gro5         2006.81    1719.61   1.167 0.246355    
## gro6         4252.92    2271.75   1.872 0.064516 .  
## dep          -118.02      33.34  -3.540 0.000642 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3714 on 88 degrees of freedom
## Multiple R-squared:  0.7985, Adjusted R-squared:  0.7802 
## F-statistic:  43.6 on 8 and 88 DF,  p-value: < 2.2e-16

Transforming GDP With the goal of producing a more reasonable residual plot, we took to transforming the output GDP variable. A log-based transformation proved to hold the most merit. Again using backwards elimination at a cutoff of 0.1, we end up with out= lm(log(gdp)~lef+gro+dep, data) It has an AIC of 209.9305 and an aj-R^2 of 0.833.

The residul plot is much nicer, but features a right end with lower variance than the rest of the plot.

## 
## Call:
## lm(formula = log(gdp) ~ lef + gro + dep, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54470 -0.27259  0.00888  0.32990  1.65535 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.047842   1.311171   3.087 0.002694 ** 
## lef          0.064662   0.014146   4.571 1.56e-05 ***
## gro2         0.560189   0.316839   1.768 0.080479 .  
## gro3         1.794455   0.261095   6.873 8.31e-10 ***
## gro4         2.026960   0.333589   6.076 2.99e-08 ***
## gro5        -0.023556   0.301106  -0.078 0.937819    
## gro6         1.015495   0.369024   2.752 0.007181 ** 
## dep         -0.016768   0.004791  -3.500 0.000728 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6794 on 89 degrees of freedom
## Multiple R-squared:  0.8432, Adjusted R-squared:  0.8308 
## F-statistic: 68.36 on 7 and 89 DF,  p-value: < 2.2e-16

Including Interactions

Using our intuition, the most logical interactions all involve the effect of groups on our base variables.

INTERACTIONS In an aim to increase the effectiveness of our model, we added interaction variables. Logically, the easily observable interactions come from the effects each variable has and how that differs from country group to country group. Thus, we added interactions between group and every other variable. From there, we used backwards elimination using 0.1 as a cutoff point. As before, we opted for model efficiency over minsicule quality increases. There are slightly lower AICs and slightly higher adjusted R^2s, but for economic purposes, we opt for the smaller model.

We find the model shown below, with an AIC of 207.171.

## 
## Call:
## lm(formula = log(gdp) ~ dea + gro + pop + dep + gro * dea, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50923 -0.24043  0.01313  0.37715  1.43415 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.891e+00  1.125e+00   7.014 5.74e-10 ***
## dea          1.583e-01  9.939e-02   1.592  0.11513    
## gro2         1.770e+00  1.132e+00   1.563  0.12181    
## gro3         3.399e+00  1.453e+00   2.339  0.02173 *  
## gro4         3.627e+00  1.208e+00   3.001  0.00355 ** 
## gro5         2.458e+00  1.141e+00   2.154  0.03413 *  
## gro6         3.813e+00  1.152e+00   3.311  0.00138 ** 
## pop         -1.099e-06  4.733e-07  -2.321  0.02271 *  
## dep         -2.595e-02  4.167e-03  -6.229 1.85e-08 ***
## dea:gro2    -1.167e-01  1.062e-01  -1.099  0.27491    
## dea:gro3    -1.287e-01  1.433e-01  -0.898  0.37187    
## dea:gro4    -1.590e-01  1.320e-01  -1.205  0.23177    
## dea:gro5    -2.639e-01  1.072e-01  -2.462  0.01589 *  
## dea:gro6    -2.828e-01  1.033e-01  -2.738  0.00756 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.652 on 83 degrees of freedom
## Multiple R-squared:  0.8653, Adjusted R-squared:  0.8442 
## F-statistic: 41.02 on 13 and 83 DF,  p-value: < 2.2e-16

Its AIC and adjusted R^2 are nearly the same, which makes this model suffer as it has more variables. However, this model has better plots with a less pronounced fan shape in the residual plot and a better normal plot. Making a judgement between the two is a matter of efficiency versus soundness of model. The second model with interactions is better, marginally, in every way, but at the expense of simplicity–again marginally.

As we can see from both of our complete models, the group that clumps to the right and has the smallest variance comes from our combo group. Accompanied with this is the clear indication that groups are significantly related to their respective countries’ GDP per capita values. Our factored group variable is extremely significant in both interactions and as a base variable. Our next step is to isolate the groups and evaluate models that best fit each.

Creating Individual Models for Each Group

EVALUATING EASTERN EUROPE

As we have with every model thus far, we use p-value backwards elimination with a 0.1 cutoff to create our favored models.

outest = lm(gdp~dea+lem+pop+age+dep, est)

We arrive at this model which features a few of the base variables and a markedly high adjusted R^2 value of 0.9925 and an AIC of 159.6315. The step function creates the same model with infant mortality and female life expectancy still present, rendering an AIC of 157. However, both of those variables had p-values higher than .1 and thus create a slightly better, but notably less efficient model. So we stick with our backwards elimination-bred structure.

What is notable here, however is the effect of East Germany on the model. It carries a large GDP per capita in comparison to the rest of Eastern Europe, and therefore is very influential on the dataset. The normal plot has a slight err to it. Residuals are spread evenly in the clumped area of the countries with smaller GDP. East Germany, however is fitted rather precisely as is observed from its small residual.

To test the ‘contribution’ that the former East Germany has on the model, we test the same formula without it. Removing East Germany and running the same model lowers the adjusted R^2 to 0.9506 and lowers AIC to 144.9833, so the model, and its components, remain sound. The residual plot Remains similar, with the smaller GDP countries showing a more even spread and the higher GDP carrying more influence and therefore being more accurately predicted. The normal plot is much straighter, however.

INT DEA LEM POP AGE DEP 8387 1179 493 -0.03208 -908.4 -291.67

12180 1498 566.1 -372.7 -1158 -341.5

Multi-colinearity proves a non-factor:

 dea      lem      pop      age      dep 

1.928555 1.518601 2.101974 2.590852 2.227930

STUDYING SOUTH AMERICA

Our favored model through backwards elimination is

lm(formula = gdp ~ lef + pop + age + dep, data = sam)

This has an adjusted R^2 of 0.7459 and an AIC of 185.0144. Again, step allows for lower AIC but at the expense of retaining inefficient variables. The following plots suggest an inconsistency in the model. Although small in observations, the residual plot is slightly concerning in that 8 of the 12 points are negative. The normal plot is offset at the rightmost points. Most telling is the cook’s distance of Bolivia, which is well over 1 given a large GDP relative to its comparatively underwhelming predictor variable stats.

Removing Bolivia yields an improved model with an adjusted R2 of 0.8824 and an AIC of 160.6462. The predictor variables remain significant. We test to see if Bolivia was an outlier with predict. predict(outsam, data.frame(lef = 55.4, pop=7132, age = 19.1, dep = 128.3), interval = “predict”, level = 1-(.05/12)) fit lwr upr 1 -1323.752 -4071.533 1424.029

This illustrates that Bolivia fits with the new model(630 fits in the interval), the issue is with its leverage moreso than its GDP disparity. Nevertheless, our Bolivia-less model has an improved residual plot and is fitted normally.

Based on these data, we’d consider the model without Bolivia to be superior, but note that there is something to consider regarding an additional factor, unrelated to population, that has affected Bolivia’s GDP per capita.

Multi-colinearity proves a non-factor lef pop age dep 3.149656 1.215085 5.004212 4.563514

CONTEMPLATING COMBO GROUP

Our best model through backwards elimination is below:

What is immediately evident through the residual plots of our models is Switzerland’s potential as an outlier. Our variables at hand do not consider details such as banking revenue, or some other outlet Switzerland has that leads it to such high GDP per capita levels. In terms of these population-based variables, Switzerland is not elite.

To prove its outlier status, we remove it from the best model involving switzerland (GDP~Birth +Infant Mortality+ Median Age), and run a bon feroni monitored prediction iterval.

predict(outcom, data.frame(bir = 12.5, inf = 7.1, pop = 6814, age = 36.9), interval = “prediction”, level = 1-(0.5/19)) fit lwr upr 1 20083.67 12556.4 27610.93

Switzerland’s response GDP(34064) is clearly influenced in a great way by other variables and is not confidently predicted by this model. The removal of Switzerland increases the predictive ability of the model and the residual shape.

vif values

 bir      inf      pop      age 

1.354181 1.123637 1.030949 1.374317

SUMMARISING SOUTHWEST ASIA

Southwest Asia proved difficult as a model subject.

MEASURING MAINLAND ASIA Our best, through backwards elimination, normal mainland Asia model re-introduces our original problem of an incorrect approach. The data needs to be transformed in some way as every normal model features the same skew featured below.

## 
## Call:
## lm(formula = gdp ~ bir + dea + mig + age, data = asi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2875.3 -1005.0  -312.7   723.8  4616.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -35064.00   12986.82  -2.700  0.01931 * 
## bir            398.40     199.13   2.001  0.06856 . 
## dea           -433.54     167.71  -2.585  0.02388 * 
## mig            209.31      60.73   3.447  0.00484 **
## age           1327.89     368.24   3.606  0.00361 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2058 on 12 degrees of freedom
## Multiple R-squared:  0.8113, Adjusted R-squared:  0.7484 
## F-statistic:  12.9 on 4 and 12 DF,  p-value: 0.0002651

Resultantly, we return to the log transformation of GDP, which again produces an improved model and residual plot.

Afghanistan has a Cook’s distance of above 1, which is worrying, removing it improves the model slightly but overall affects little.

Analyzing Africa

Using a normal analysis, we get a rather ineffective model with a low R^2 and high AIC.

## 
## Call:
## lm(formula = gdp ~ lef + pop + age + dep, data = afr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1091.82  -418.10   -86.45   380.63  3154.17 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.630e+04  1.490e+04   1.765  0.09143 . 
## lef          1.004e+02  2.955e+01   3.398  0.00258 **
## pop         -1.341e-02  7.606e-03  -1.763  0.09183 . 
## age         -9.259e+02  4.978e+02  -1.860  0.07631 . 
## dep         -1.015e+02  4.774e+01  -2.127  0.04491 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 852.6 on 22 degrees of freedom
## Multiple R-squared:  0.4929, Adjusted R-squared:  0.4007 
## F-statistic: 5.346 on 4 and 22 DF,  p-value: 0.003663

On top of that, the variables feature pronounced multicolinearity. So we scrap it and work with a log transformation again.The new model is nicer, despite amouting to one variable. Again, it proves difficult to fully explain the GDP of Africa’s nations given population variables alone. The adjusted R^2 and AIC are weak compared to the other country groups.

##       lef       pop       age       dep 
##  1.545762  1.051320 18.238049 16.629151
## 
## Call:
## lm(formula = log(gdp) ~ lef, data = afr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5575 -0.4801 -0.1066  0.4896  1.4855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.45429    1.13297   0.401    0.692    
## lef          0.10610    0.02076   5.111  2.8e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7446 on 25 degrees of freedom
## Multiple R-squared:  0.511,  Adjusted R-squared:  0.4914 
## F-statistic: 26.12 on 1 and 25 DF,  p-value: 2.797e-05